Refamiliarize with Silge

John Little

2021-04-14

Find this repository: https://github.com/libjohn/workshop_textmining

Much of this review comes from the site: https://juliasilge.github.io/tidytext/

The primary library package tidytext enables all kinds of text mining. See Also this helpful free online book: Text Mining with R: A Tidy Approach by Silge and Robinson

library(janeaustenr)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidytext)
library(wordcloud2)

RfunRfun

Data

We’ll look at some books by Jane Austen, an 18th century novelist. Austen explored women and marriage within the British upper class. The novelist has a unique and well earned following within literature. Her works is consistently discussed and honored. To this day, Austen’s novels are the source of many adaptations, written and on-screen. Through the janeaustenr package we can access and mine the text of six Austen novels. We can call the collection of novels a corpra. An individual novel is a corpus.

austen_books()
## # A tibble: 73,422 x 2
##    text                    book               
##  * <chr>                   <fct>              
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility
##  2 ""                      Sense & Sensibility
##  3 "by Jane Austen"        Sense & Sensibility
##  4 ""                      Sense & Sensibility
##  5 "(1811)"                Sense & Sensibility
##  6 ""                      Sense & Sensibility
##  7 ""                      Sense & Sensibility
##  8 ""                      Sense & Sensibility
##  9 ""                      Sense & Sensibility
## 10 "CHAPTER 1"             Sense & Sensibility
## # ... with 73,412 more rows

Austen is best know for six published works:

austen_books() %>% 
  distinct(book)
## # A tibble: 6 x 1
##   book               
##   <fct>              
## 1 Sense & Sensibility
## 2 Pride & Prejudice  
## 3 Mansfield Park     
## 4 Emma               
## 5 Northanger Abbey   
## 6 Persuasion

Data Cleaning

Text mining typically requires a lot of data cleaning. In this case, we start with the janeaustenr collection that has already been cleaned. Nonetheless, further data wrangling is required. First, identifying a line number for each line of text in each book.

Identify line numbers

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(line = row_number()) %>%         # identify line numbers
  ungroup()

original_books
## # A tibble: 73,422 x 3
##    text                    book                 line
##    <chr>                   <fct>               <int>
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility     1
##  2 ""                      Sense & Sensibility     2
##  3 "by Jane Austen"        Sense & Sensibility     3
##  4 ""                      Sense & Sensibility     4
##  5 "(1811)"                Sense & Sensibility     5
##  6 ""                      Sense & Sensibility     6
##  7 ""                      Sense & Sensibility     7
##  8 ""                      Sense & Sensibility     8
##  9 ""                      Sense & Sensibility     9
## 10 "CHAPTER 1"             Sense & Sensibility    10
## # ... with 73,412 more rows

Tokens

To work with these data as a tidy dataset, we need to restructure the data through tokenization. In our case a token is a single word. We want one-token-per-row. The unnest_tokens() function (tidytext package) will convert a data frame with a text column into the one-token-per-row format.

Token
Tokenization
defined

The default tokenizing mode is “words”. With the unnest_tokens() function, tokens can be: words, characters, character_shingles, ngrams, skip_ngrams, sentences, lines, paragraphs, regex, tweets, and ptb (Penn Treebank).

Process

  1. Group by line number (above)
  2. Make each single word a token
tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books
## # A tibble: 725,055 x 3
##    book                 line word       
##    <fct>               <int> <chr>      
##  1 Sense & Sensibility     1 sense      
##  2 Sense & Sensibility     1 and        
##  3 Sense & Sensibility     1 sensibility
##  4 Sense & Sensibility     3 by         
##  5 Sense & Sensibility     3 jane       
##  6 Sense & Sensibility     3 austen     
##  7 Sense & Sensibility     5 1811       
##  8 Sense & Sensibility    10 chapter    
##  9 Sense & Sensibility    10 1          
## 10 Sense & Sensibility    13 the        
## # ... with 725,045 more rows

Now that the data is in the one-word-per-row format, we can manipulate it with tidy tools like dplyr.

Stop Words

tidytext::get_stopwords()

Remove stop-words from the books.

matchwords_books <- tidy_books %>%
  anti_join(get_stopwords())
## Joining, by = "word"
matchwords_books
## # A tibble: 325,084 x 3
##    book                 line word       
##    <fct>               <int> <chr>      
##  1 Sense & Sensibility     1 sense      
##  2 Sense & Sensibility     1 sensibility
##  3 Sense & Sensibility     3 jane       
##  4 Sense & Sensibility     3 austen     
##  5 Sense & Sensibility     5 1811       
##  6 Sense & Sensibility    10 chapter    
##  7 Sense & Sensibility    10 1          
##  8 Sense & Sensibility    13 family     
##  9 Sense & Sensibility    13 dashwood   
## 10 Sense & Sensibility    13 long       
## # ... with 325,074 more rows

Join types

Customize your dictionaries

You can customize stop-words data frames, sentiment data frames, etc.

There are various stop words dictionaries. Here we add the stop word, “farfegnugen” to a custom dictionary. If Jane Austen ever used the word “farfegnugen” that would be weird, or bad. So we will take pains to not calculate the sentiment of that word - whether or not the term shows up in a sentiment dictionary. That is, we will remove the word by making it a part of a customized stop-words dictionary.

stopwords::stopwords_getsources()
## [1] "snowball"      "stopwords-iso" "misc"          "smart"        
## [5] "marimo"        "ancient"       "nltk"          "perseus"
stopwords::stopwords_getlanguages("snowball")
##  [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"
stopwords_custom <- tribble(~word, ~lexicon,
                            "farfegnugen", "custom")

stopwords_custom
## # A tibble: 1 x 2
##   word        lexicon
##   <chr>       <chr>  
## 1 farfegnugen custom
get_stopwords(source = "snowball")
## # A tibble: 175 x 2
##    word      lexicon 
##    <chr>     <chr>   
##  1 i         snowball
##  2 me        snowball
##  3 my        snowball
##  4 myself    snowball
##  5 we        snowball
##  6 our       snowball
##  7 ours      snowball
##  8 ourselves snowball
##  9 you       snowball
## 10 your      snowball
## # ... with 165 more rows
bind_rows(get_stopwords(), stopwords_custom)    # The default is "snowball"
## # A tibble: 176 x 2
##    word      lexicon 
##    <chr>     <chr>   
##  1 i         snowball
##  2 me        snowball
##  3 my        snowball
##  4 myself    snowball
##  5 we        snowball
##  6 our       snowball
##  7 ours      snowball
##  8 ourselves snowball
##  9 you       snowball
## 10 your      snowball
## # ... with 166 more rows

Calculate word frequency

How many Austen countable words are there if we remove snowball stop-words? There are 14375 countable words.

matchwords_books %>% 
  # distinct(word)
  count(word, sort = TRUE) 
## # A tibble: 14,375 x 2
##    word      n
##    <chr> <int>
##  1 mr     3015
##  2 mrs    2446
##  3 must   2071
##  4 said   2041
##  5 much   1935
##  6 miss   1855
##  7 one    1831
##  8 well   1523
##  9 every  1456
## 10 think  1440
## # ... with 14,365 more rows

Word clouds

matchwords_books %>%
  count(word, sort = TRUE) %>%
  head(100) %>% 
  wordcloud2(size = .4, shape = 'triangle-forward', 
             color = c("steelblue", "firebrick", "darkorchid"), 
             backgroundColor = "salmon")

Basic word cloud

A non-interactive word cloud.

matchwords_books %>%
  count(word) %>%
  with(wordcloud::wordcloud(word, n, max.words = 100))

Your Turn: Exercise 1

Goal: Make a basic word cloud for the novel, Pride and Predjudice, pride_prej_novel

  1. Prepare
pride_prej_novel <- tibble(text = prideprejudice) %>% 
  mutate(line = row_number())
  1. Tokenize pride_prej_novel with unnest_tokens()

  2. Remove stop-words

  3. calculate word frequency

  4. make a simple wordcloud

Sentiment Analysis

get_sentiments()

Let’s see what positive words exist in the bing dictionary. Then, count the frequency of those positive words that exist in Emma.

positive <- get_sentiments("bing") %>%
  filter(sentiment == "positive")                    # get POSITIVE words

positive 
## # A tibble: 2,005 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abound      positive 
##  2 abounds     positive 
##  3 abundance   positive 
##  4 abundant    positive 
##  5 accessable  positive 
##  6 accessible  positive 
##  7 acclaim     positive 
##  8 acclaimed   positive 
##  9 acclamation positive 
## 10 accolade    positive 
## # ... with 1,995 more rows
tidy_books %>%
  filter(book == "Emma") %>%                        # only the book _emma_
  semi_join(positive) %>%                           # semi_join()
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 668 x 2
##    word         n
##    <chr>    <int>
##  1 well       401
##  2 good       359
##  3 great      264
##  4 like       200
##  5 better     173
##  6 enough     129
##  7 happy      125
##  8 love       117
##  9 pleasure   115
## 10 right       92
## # ... with 658 more rows

Prepare to visualize sentiment score

Match all the Austen books to the bing sentiment dictionary. Count the word frequency.

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book)
## Joining, by = "word"
## # A tibble: 6 x 2
##   book                    n
##   <fct>               <int>
## 1 Sense & Sensibility  8604
## 2 Pride & Prejudice    8704
## 3 Mansfield Park      11577
## 4 Emma                11966
## 5 Northanger Abbey     5762
## 6 Persuasion           5674

Calculate sentiment

Algorithm: sentiment = positive - negative

Define a section of text.

"Small sections of text may not have enough words in them to get a good estimate of sentiment while really large sections can wash out narrative structure. For these books, using 80 lines works well, but this can vary depending on individual texts… – Text Mining with R

bing <- get_sentiments("bing")

janeaustensentiment <- tidy_books %>% 
  inner_join(bing) %>% 
  count(book, index = line %/% 80, sentiment) %>%                          # `%/%` = int division ; 80 lines / section
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%    # spread(sentiment, n, fill = 0)
  mutate(sentiment = positive - negative)                                      # ALGO!!!
## Joining, by = "word"
janeaustensentiment
## # A tibble: 920 x 5
##    book                index negative positive sentiment
##    <fct>               <dbl>    <int>    <int>     <int>
##  1 Sense & Sensibility     0       16       32        16
##  2 Sense & Sensibility     1       19       53        34
##  3 Sense & Sensibility     2       12       31        19
##  4 Sense & Sensibility     3       15       31        16
##  5 Sense & Sensibility     4       16       34        18
##  6 Sense & Sensibility     5       16       51        35
##  7 Sense & Sensibility     6       24       40        16
##  8 Sense & Sensibility     7       23       51        28
##  9 Sense & Sensibility     8       30       40        10
## 10 Sense & Sensibility     9       15       19         4
## # ... with 910 more rows

Viz it

janeaustensentiment %>%
  ggplot(aes(index, sentiment, )) +
  geom_col(show.legend = FALSE, fill = "cadetblue") +
  geom_col(data = . %>% filter(sentiment < 0), show.legend = FALSE, fill = "firebrick") +
  geom_hline(yintercept = 0, color = "goldenrod") +
  facet_wrap(~ book, ncol = 2, scales = "free_x") 

Preparation: Most common positive and negative words

bing_word_counts <- tidy_books %>%
  inner_join(bing) %>%
  count(word, sentiment, sort = TRUE)
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows

Viz it too

bing_word_counts %>%
  filter(n > 170) %>%
  mutate(n = if_else(sentiment == "negative", - n, n)) %>%
  ggplot(aes(fct_reorder(str_to_title(word), n), n, fill = str_to_title(sentiment))) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(type = "qual") +
  guides(fill = guide_legend(reverse = TRUE)) +
  labs(title = "Frequency of popular positive and negative words",
       subtitle = "Jane Austen novels",
       y = "Compound sentiment score", x = "",
       fill = "Sentiment", caption = "Source: library(janeaustenr)") +
  theme(plot.title.position = "plot")

Dictionaries

What other dictionaries are available? How to choose?

head(get_sentiments("bing"))
## # A tibble: 6 x 2
##   word       sentiment
##   <chr>      <chr>    
## 1 2-faces    negative 
## 2 abnormal   negative 
## 3 abolish    negative 
## 4 abominable negative 
## 5 abominably negative 
## 6 abominate  negative
head(get_sentiments("loughran"))
## # A tibble: 6 x 2
##   word         sentiment
##   <chr>        <chr>    
## 1 abandon      negative 
## 2 abandoned    negative 
## 3 abandoning   negative 
## 4 abandonment  negative 
## 5 abandonments negative 
## 6 abandons     negative
head(get_sentiments("nrc"))
## # A tibble: 6 x 2
##   word      sentiment
##   <chr>     <chr>    
## 1 abacus    trust    
## 2 abandon   fear     
## 3 abandon   negative 
## 4 abandon   sadness  
## 5 abandoned anger    
## 6 abandoned fear
head(get_sentiments("afinn"))
## # A tibble: 6 x 2
##   word       value
##   <chr>      <dbl>
## 1 abandon       -2
## 2 abandoned     -2
## 3 abandons      -2
## 4 abducted      -2
## 5 abduction     -2
## 6 abductions    -2
get_sentiments("nrc") %>% 
  count(sentiment, sort = TRUE) 
## # A tibble: 10 x 2
##    sentiment        n
##    <chr>        <int>
##  1 negative      3324
##  2 positive      2312
##  3 fear          1476
##  4 anger         1247
##  5 trust         1231
##  6 sadness       1191
##  7 disgust       1058
##  8 anticipation   839
##  9 joy            689
## 10 surprise       534

Afinn

What words in Emma match the AFINN dictionary?

emma_afinn <- tidy_books %>%
  filter(book == "Emma") %>% 
  anti_join(get_stopwords()) %>% 
  inner_join(get_sentiments("afinn"))
## Joining, by = "word"
## Joining, by = "word"
emma_afinn
## # A tibble: 10,159 x 4
##    book   line word         value
##    <fct> <int> <chr>        <dbl>
##  1 Emma     15 clever           2
##  2 Emma     15 rich             2
##  3 Emma     15 comfortable      2
##  4 Emma     16 happy            3
##  5 Emma     16 best             3
##  6 Emma     18 distress        -2
##  7 Emma     20 affectionate     3
##  8 Emma     22 died            -3
##  9 Emma     24 excellent        3
## 10 Emma     25 fallen          -2
## # ... with 10,149 more rows
emma_afinn %>% 
  count(word, sort = TRUE)
## # A tibble: 894 x 2
##    word       n
##    <chr>  <int>
##  1 miss     599
##  2 good     359
##  3 great    264
##  4 dear     241
##  5 like     200
##  6 better   173
##  7 hope     143
##  8 poor     136
##  9 wish     135
## 10 happy    125
## # ... with 884 more rows

Make Sections

Just as we calculated sentiment, above, make sections of 80 words then calculate sentiment.

emma_afinn_sentiment <- emma_afinn %>% 
  mutate(word_count = 1:n(),
         index = word_count %/% 80) %>% 
  group_by(index) %>% 
  summarise(sentiment = sum(value))           ## ALGO sum each Afinn score in the 80 word section
## `summarise()` ungrouping output (override with `.groups` argument)
emma_afinn_sentiment
## # A tibble: 127 x 2
##    index sentiment
##    <dbl>     <dbl>
##  1     0        40
##  2     1        33
##  3     2        77
##  4     3        84
##  5     4        52
##  6     5        80
##  7     6        98
##  8     7        80
##  9     8        69
## 10     9        68
## # ... with 117 more rows

Viz it

emma_afinn %>% 
  mutate(word_count = 1:n(),
         index = word_count %/% 80) %>% 
  filter(index == 104) %>% 
  count(word, sort = TRUE) %>%
  wordcloud2(size = .4, shape = 'diamond', 
             backgroundColor = "darkseagreen")
emma_afinn_sentiment %>% 
  ggplot(aes(index, sentiment)) +
  geom_col(aes(fill = cut_interval(sentiment, n = 5))) +
  geom_hline(yintercept = 0, color = "forestgreen", linetype = "dashed") +
  scale_fill_brewer(palette = "RdBu", guide = FALSE) +
  theme(panel.background = element_rect(fill = "grey"),
        plot.background = element_rect(fill = "grey"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  labs(title = "Afinn Sentiment Analysis of _Emma_")

emma_afinn %>%
  mutate(word_count = 1:n(),
         index = as.character(word_count %/% 80)) %>%
  filter(index == 10 | index == 104 | index == 105) %>% 
  ggplot(aes(value, index)) +
  geom_boxplot() +
  # geom_boxplot(notch = TRUE) +
  geom_jitter() +
  coord_flip() +
  labs(y = "section", x = "Afinn")

Resources


John Little
Rfun
Center for Data & Visualization Sciences

 

CC BY-NC
Creative Commons: Attribution, Non-commercial
https://creativecommons.org/licenses/by-nc/4.0/